One Sense Per Discourse

نویسندگان

  • William A. Gale
  • Kenneth Ward Church
  • David Yarowsky
چکیده

It is well-known that there are polysemous words like sentence whose ‘‘meaning’’ or ‘‘sense’’ depends on the context of use. We have recently reported on two new word-sense disambiguation systems, one trained on bilingual material (the Canadian Hansards) and the other trained on monolingual material (Roget’s Thesaurus and Grolier’s Encyclopedia). As this work was nearing completion, we observed a very strong discourse effect. That is, if a polysemous word such as sentence appears two or more times in a well-written discourse, it is extremely likely that they will all share the same sense. This paper describes an experiment which confirmed this hypothesis and found that the tendency to share sense in the same discourse is extremely strong (98%). This result can be used as an additional source of constraint for improving the performance of the word-sense disambiguation algorithm. In addition, it could also be used to help evaluate disambiguation algorithms that did not make use of the discourse constraint. 1. Our Previous Work on Word-Sense Disambiguation 1.1 Date Deprivation Although there has been a long history of work on wordsense disambiguation, much of the work has been stymied by difficulties in acquiring appropriate testing and training materials. AI approaches have tended to focus on ‘‘toy’’ domains because of the difficulty in acquiring large lexicons. So too, statistical approaches, e.g., Kelly and Stone (1975), Black (1988), have tended to focus on a relatively small set of polysemous words because they have depended on extremely scarce hand-tagged materials for use in testing and training. We have achieved considerable progress recently by using a new source of testing and training materials and the application of Bayesian discrimination methods. Rather than depending on small amounts of hand-tagged text, we have been making use of relatively large amounts of parallel text, text such as the Canadian Hansards, which are available in multiple languages. The translation can often be used in lieu of hand-labeling. For example, consider the polysemous word sentence, which has two major senses: (1) a judicial sentence, and (2), a syntactic sentence. We can collect a number of sense (1) examples by extracting instances that are translated as peine, and we can collect a number of sense (2) examples by extracting instances that are translated as phrase. In this way, we have been able to acquire a considerable amount of testing and training material for developing and testing our disambiguation algorithms. 1.2 Bayesian Discrimination Surprisingly good results can be achieved using Bayesian discrimination methods which have been used very successfully in many other applications, especially author identification (Mosteller and Wallace, 1964) and information retrieval (IR) (Salton, 1989, section 10.3). Our word-sense disambiguation algorithm uses the words in a 100-word context surrounding the polysemous word very much like the other two applications use the words in a test document.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

One Translation Per Discourse

We revisit the one sense per discourse hypothesis of Gale et al. in the context of machine translation. Since a given sense can be lexicalized differently in translation, do we observe one translation per discourse? Analysis of manual translations reveals that the hypothesis still holds when using translations in parallel text as sense annotation, thus confirming that translational differences ...

متن کامل

"One Entity per Discourse" and "One Entity per Collocation" Improve Named-Entity Disambiguation

The “one sense per discourse” (OSPD) and “one sense per collocation” (OSPC) hypotheses have been very influential in Word Sense Disambiguation. The goal of this paper is twofold: (i) to explore whether these hypotheses hold for entities, that is, whether several mentions in the same discourse (or the same collocation) tend to refer to the same entity or not, and (ii) test their impact in Named-...

متن کامل

Automatic Resolution of Ambiguous Abbreviations in Biomedical Texts using Support Vector Machines and One Sense Per Discourse Hypothesis

We present an algorithm to disambiguate abbreviations in Medline abstracts using Support Vector Machines (SVM) and one sense per discourse hypothesis. In contrast to other work using SVM for natural language disambiguation which always depend on handcrafted training and testing data, the algorithm provided here automatically extracts the training and testing data through searching long form of ...

متن کامل

More than One Sense Per Discourse

Previous research has indicated that when a polysemous word appears two or more times in a discourse, it is extremely likely that they will all share the same sense [Gale et al. 92]. However, those results were based on a coarse-grained distinction between senses (e.g, sentence in the sense of a ‘prison sentence’ vs. a ‘grammatical sentence’). We report on an analysis of multiple senses within ...

متن کامل

Improving Word Sense Disambiguation in Lexical Chaining

Previous algorithms to compute lexical chains suffer either from a lack of accuracy in word sense disambiguation (WSD) or from computational inefficiency. In this paper, we present a new lineartime algorithm for lexical chaining that adopts the assumption of one sense per discourse. Our results show an improvement over previous algorithms when evaluated on a WSD task.

متن کامل

Combined One Sense Disambiguation of Abbreviations

A process that attempts to solve abbreviation ambiguity is presented. Various contextrelated features and statistical features have been explored. Almost all features are domain independent and language independent. The application domain is Jewish Law documents written in Hebrew. Such documents are known to be rich in ambiguous abbreviations. Various implementations of the one sense per discou...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992